Source: https://www.kaggle.com/zynicide/wine-reviews/downloads/wine-reviews.zip/4

My dataset is comprised of almost 130,000 reviews of invididual wines organized price, rating on a 0-100 point scale, nationality, type, year, taster, and winery of origin. Wine snobs annoy me, so I wanted to see if anything they have to say about quality holds water statistically.

data <- read.csv(file="winemag-data-130k-v2.csv")
data

Common assertions about wine include a relationship between price and quality, the statement that “X was a good year for Y wine,” and the idea that certain countries make better wines. I’m going to explore these relationships using this database.

First, the columns of the table.

colnames(data)
##  [1] "X"                     "country"              
##  [3] "description"           "designation"          
##  [5] "points"                "price"                
##  [7] "province"              "region_1"             
##  [9] "region_2"              "taster_name"          
## [11] "taster_twitter_handle" "title"                
## [13] "variety"               "winery"

These can be refined or removed to add clarity. X is wholly unnecessary in this environment, denoting a row ID, while description and designations’ use as qualitative data is irrelevant in the context of this paper.

data <- data %>% mutate(X=NULL, description=NULL, designation=NULL)

This removes those three columns from the table.

Next, what makes a good wine according to the data? Sorting mean rating by country and province of origin is easy enough.

mean_score_nationality <- data %>% select(points, country) %>% group_by(country)%>% summarize(score=mean(points))
mean_score_nationality

According to this, England produces the best wine on average, but a graphical aid would better display the differences between countries.

  mean_score_nationality %>% ggplot(aes(x=country, y=score)) + geom_bar(stat="identity", width=.5) + labs(x="country", y="mean rating")

This shows that while differences exist in ratings by nationality, the typical magnitude of that difference is relatively small. The power of data science is that a few lines of code can reduce an otherwise-insurmountable quantity of measurements to a graphic or figure parseable by the unaided eye.

How does price affect rating? This question is better suited to a linear regression of the data to find the relationship between the two continuous variables.

data %>% ggplot(aes(x=price,y=points))+geom_point()+geom_smooth(method=lm)+labs(x="price", y="rating")
## Warning: Removed 8996 rows containing non-finite values (stat_smooth).
## Warning: Removed 8996 rows containing missing values (geom_point).

The regression line on the graph looks off, so let’s gather some information about the relationship.

lmfit <- lm(points ~ price, data)
lmfit
## 
## Call:
## lm(formula = points ~ price, data = data)
## 
## Coefficients:
## (Intercept)        price  
##    87.32964      0.03089

The linear regression predicts that for every increment in price, the rating of the wine increases by .03. However, the graph from earlier looked like the bulk of the data was significantly below the line. The broom package provides functions for measuring the strength of a regression relationship.

tidy(lmfit)
glmfit <- glm(points ~ price, data, family="poisson")
tidy(glmfit)

The standard error on the logarithmic fit is orders of magnitude lower than the linear fit, suggesting that the relationship is more accurately represented in this manner.

What about the certain years being good for wines? The title column contains the year nestled between other, less interesting factoids regarding the type of wine. To access the year in question, we can scrape the year out of the attribute using stringr functions and feeding the data back into the existing dataframe with tidyr’s mutate.

data = data %>% mutate(years = as.numeric(str_extract(title, "\\d{4}")))
data %>% select(years, points)

Now that the years have been scraped from applicable entries, let’s create a boxplot to compare mean scores.

data  %>% filter(data$years >=1850 & data$years<1870) %>% group_by(years) %>% ggplot(aes(factor(years), points)) + geom_boxplot()

data  %>% filter(data$years >=1870 & data$years<1890) %>% group_by(years) %>% ggplot(aes(factor(years), points)) + geom_boxplot()

data  %>% filter(data$years >=1890 & data$years<1910) %>% group_by(years) %>% ggplot(aes(factor(years), points)) + geom_boxplot()

data  %>% filter(data$years >=1910 & data$years<1930) %>% group_by(years) %>% ggplot(aes(factor(years), points)) + geom_boxplot()

data  %>% filter(data$years >=1930 & data$years<1950) %>% group_by(years) %>% ggplot(aes(factor(years), points)) + geom_boxplot()

data  %>% filter(data$years >=1950 & data$years<1970) %>% group_by(years) %>% ggplot(aes(factor(years), points)) + geom_boxplot()

data  %>% filter(data$years >=1970 & data$years<1990) %>% group_by(years) %>% ggplot(aes(factor(years), points)) + geom_boxplot()

data  %>% filter(data$years >=1990 & data$years<2010) %>% group_by(years) %>% ggplot(aes(factor(years), points)) + geom_boxplot()

data  %>% filter(data$years >=2010 & data$years < 3000) %>% group_by(years) %>% ggplot(aes(factor(years), points)) + geom_boxplot()

Excluding years where there were one or fewer wines reviewed, notable years for above-average quality include 1927, 1952, 1980, and 1984. This demonstrates the potency of data science, as I’ve gone from knowing nothing about wine to knowing how nationality affects quality(not by much), how the price of a bottle affects its rating(it does, up to a point), and that when sommeliers say that the grapes were good in 1952, they might actually be onto something. Thank you for reading.